Run-time parallelization of loops in computer programs

ABSTRACT

Parallelization of loops is performed for loops having indirect loop index variables and embedded conditional statements in the loop body. Loops having any finite number of array variables in the loop body, and any finite number of indirect loop index variables can be parallelized. There are two particular limitations of the described techniques: (i) that there are no cross-iteration dependencies in the loop other than through the indirect loop index variables; and (ii) that the loop index variables (either direct or indirect) are not redefined in the loop body.

FIELD OF THE INVENTION

The present invention relates to run-time parallelization of computerprograms that have loops containing indirect loop index variables andembedded conditional statements.

BACKGROUND

A key aspect of parallel computing is the ability to exploit parallelismin one or more loops in computer programs. Loops that do not havecross-iteration dependencies, or where such dependencies are linear withrespect to the loop index variables, one can use various existingtechniques to achieve parallel processing. A suitable reference for suchtechniques is Wolfe, M., High Performance Compilers for ParallelComputing, Addison-Wesley, 1996, Chapters 1 and 7. Such techniquesperform a static analysis of the loop at compile-time. The compilersuitably groups and schedules loop iterations in parallel batcheswithout violating the original semantics of the loop.

There are, however, many cases in which static analysis of the loop isnot possible. Compilers, in such cases, cannot attempt anyparallelization of the loop before run-time.

As an example, consider the loop of Table 1 below, for whichparallelization cannot be performed. TABLE 1 do i = 1, n x[u(i)] = . . .. . . . . . . . . . . . . . . . . y[i] = x[r(i)] . . . . . . . . . . . .. . . . . . . enddo

Specifically, until the indirect loop index variables u(i) and r(i) areknown, loop parallelization cannot be attempted for the loop of Table 1.

For a review on run-time parallelization techniques, refer toRauchwerger, L., Run-Time Parallelization: It's Time Has Come, Journalof Parallel Computing, Special Issue on Language and Compilers, Vol. 24,Nos. 3-4, 1998, pp. 527-556. A preprint of this reference is availablevia the World Wide Web at the addresswww.cs.tamu.edu/faculty/rwerger/pubs.

Further difficulties, not discussed by Wolfe or Rauchwerger, arise whenthe loop body contains one or more conditional statements whoseevaluation is possible only during runtime. As an example, consider theloop of Table 2 below, for which parallelization cannot be attempted bya compiler. TABLE 2 do i = 1, n x[u(i)] = . . . . . . . . . . . . . . .. . . . . if (cond) then y[i] = x[r(i)] . . . else y[i] = x[s(i)] . . .. . . . . . . . . . . . . . . . enddo

The value of r(i) and s(i) in the loop of Table 2 above, as well as theindirect loop index variables u(i) must be known before loopparallelization can be attempted. Further, in each iteration, the valueof cond must be known to decide whether r(i) or s(i) should be includedin a particular iteration.

Further advances in loop parallelisation are clearly needed in view ofthese and other observations.

SUMMARY

A determination is made whether a particular loop in a computer programcan be parallelized. If parallelization is possible, a suitable strategyfor parallelization is provided. The techniques described herein aresuitable for loops in which

-   (i) there are any finite number of array variables in the loop body,    such as x and y in the example of Table 2 above;-   (ii) there are any finite number of indirect loop index variables,    such as u, r, and s in the example of Table 2 above;-   (iii) each element of each array variable and of each indirect loop    index variable is uniquely identifiable by a direct loop index    variable, such as i in the example of Table 2 above;-   (iv) the loop index variables (either direct or indirect variables)    are not redefined within the loop; and-   (v) there are no cross-iteration dependencies in the loop other than    through the indirect loop index variables.

Parallelization is attempted at run-time for loops, as noted above,having indirect loop index variables and embedded conditional statementsin the loop body. A set of active array variables and a set of indirectloop index variables are determined for the loop under consideration.Respective ranges of the direct loop index values and indirect loopindex values are determined. Indirect loop index values are determinedfor each iteration, and each such value so determined is associated witha unique number. Based on these unique numbers, an indirectly indexedaccess pattern for each iteration in the loop is calculated.

Using the indirectly indexed access pattern, the loop iterations aregrouped into a minimum number of waves such that the iterationscomprising a wave have no cross-iteration dependencies among themselves.The waves are then scheduled in a predetermined sequence and theiterations in a wave are executed independent of each other in thepresence of multiple computing processors.

DESCRIPTION OF DRAWINGS

FIG. 1 is a flow chart of steps involved in performing run-timeparallelization of a loop that has indirect loop index variables and oneembedded Boolean condition.

FIGS. 2A, 2B and 2C jointly form a flow chart of steps representing analgorithm for performing run-time parallelization.

FIG. 3 is a schematic representation of a computer system suitable forperforming the run-time parallelization techniques described herein.

DETAILED DESCRIPTION

The following two brief examples are provided to illustrate cases inwhich an apparently unparallelizable loop can be parallelized bymodifying the code, but not its semantics. Table 3 below provides afirst brief example. TABLE 3 b = b0 do i = 1, n x[u(i)] = b b = b+1enddo

The loop of Table 3 above cannot be parallelized, since the calculatedvalue of b depends on the iteration count i. For example, for the 3rditeration, x[u (3)]=b0+2, where b0 is the value of b just prior toentering the loop. The loop can, however, be parallelized if the loop isrewritten as shown in Table 4 below. TABLE 4 b = b0 do i = 1, n x[u(i)]= b0 + i − 1 enddo b = b0 + n

Table 5 below provides a second brief example. TABLE 5 do i = 1, n c =x[u(i)] ... ... ... enddo

The loop of Table 5 above is parallelizable if the loop is rewritten asshown in Table 6 below. TABLE 6 do i = 1, n c = x[u(i)] ... ... ...enddo c = x[u(n)]

These and other existing rules that improve parallelization of loops canbe invoked whenever applicable. The above-mentioned references of Wolfeand Rauchwerger are suitable references for further such rules that canbe adopted as required. The above referenced content of these referencesis incorporated herein by reference.

Loop Parallelization Procedure

The loop parallelization procedure described herein is described ingreater detail with reference to the example of Table 7 below. TABLE 7do i = 5, 15 x1[r(i)] = s1[u(i)] x2[t(i)] = s2[r(i)] * s1[t(i)] . . .x3[u(i)] = x1[r(i)]/x3[u(i)] if (x2[t(i)]) then x4[v(i)] = s2[r(i)] +x5[t(i)] . . . else x3[v(i)] = x5[w(i)] x5[u(i)] = x3[v(i)] + x4[v(i)] .. . x6[u(i)] = x6[u(i)] − . . . x7[v(i)] = x7[v(i)] + x1[r(i)] −s1[u(i)] . . . . . . . . . . . . . . . . enddo

In some cases, the analysis of cross-iteration dependencies issimplified if an array element that appears on the right hand side of anassignment statement is replaced by the most recent expression definingthat element, if the expression exists in a statement prior to thisassignment statement. In the example of Table 7 above, x1[r(i)] is suchan element whose appearance on the right hand side of assignmentstatements for x3[u(i)] and x7[v(i)] can be replaced by s1[u(i)] sincethere is an earlier assignment statement x1[r(i)]=s1[u(i)].

Thus, for the example of Table 7 above, the code fragment of Table 8below represents the example of Table 7 above, after such operations areperformed, and represents the results of appropriate replacement. TABLE8 do i = 5, 15 x1[r(i)] = s1[u(i)] // Defines x1[r(i)] x2[t(i)] =s2[r(i)] * s1[t(i)] . . . x3[u(i)] = (s1[u(i)])/x3[u(i)] // Replacesx1[r(i)] if (x2[t(i)]) then x4[v(i)] = s2[r(i)] + x5[t(i)] . . . elsex3[v(i)] = x5[w(i)] x5[u(i)] = x3[v(i)] + x4[v(i)] . . . x6[u(i)] =x6[u(i)] − . . . x7[v(i)] = x7[v(i)] // Identity after replacingx1[r(i)] . . . . . . . . . . . . . . . . enddo

Further simplification of the code fragment of Table 8 above is possibleif statements that are identities, or become identities after thereplacement operations, are deleted. Finally, if the array variable x1is a temporary variable that is not used after the loop is completelyexecuted, then the assignment statement defining this variable (thefirst underlined statement in the code fragment of Table 8 above) isdeleted without any semantic loss, consequently producing thecorresponding code fragment of Table 9 below. TABLE 9 do i = 5, 15x2[t(i)] = s2[r(i)] * s1[t(i)] . . . x3[u(i)] = (s1[u(i)])/x3[u(i)] if(x2[t(i)]) then x4[v(i)] = s2[r(i)] + x5[t(i)] . . . else x3[v(i)] =x5[w(i)] x5[u(i)] = x3[v(i)] + x4[v(i)] . . . x6[u(i)] = x6[u(i)] − . .. . . . . . . . . . . . . . . . . enddo

The array element replacement operations described above with referenceto the resulting code fragment of Table 9 above can be performed insource code, using character string “find and replace” operations. Toensure semantic correctness, the replacement string is enclosed inparentheses, as is done in Table 8 for the example of Table 7. Todetermine if an assignment statement expresses an identity, or tosimplify the assignment statement, one may use any suitable technique.One reference describing suitable techniques is commonly assigned U.S.patent application Ser. No 09/597,478, filed Jun. 20, 2000, naming asinventor Rajendra K Bera and entitled “Determining the equivalence oftwo algebraic expressions”. The content of this reference is herebyincorporated by reference.

Potential advantages gained by the techniques described above are areduced number of array variables for analysis, and a clearer indicationof cross-iteration dependencies within a loop. Further, a few generalobservations can be made with reference to the example of Table 7 above.

First, non-conditional statements in the loop body that do not containany array variables do not constrain parallelization, since anassumption is made that cross-iteration dependencies do not exist due tosuch statements. If such statements exist, however, a further assumptionis made that these statements can be handled, so as to allowparallelization.

Secondly, only array variables that are defined (that is, appear on theleft hand side of an assignment statement) in the loop body affectparallelization. In the case of Table 9 above, the set of suchvariables, referred to as active array variables, is {x2, x3, x4, x5,x6} when the condition part in the statement if (x2[t(i)]) evaluates totrue and {x2, x3, x5, x6} when this statement evaluates to false.

If, for a loop, every possible set of active array variables is empty,then that loop is completely parallelizable.

Since detection of variables that affect loop parallelization can beperformed by a compiler through static analysis, this analysis can beperformed by the compiler. Thus, respective lists of array variablesthat affect parallelization for each loop in the computer program can beprovided by the compiler to the run-time system.

In the subsequent analysis, only indirect loop index variablesassociated with active array variables are considered. In the example ofTable 9 above, these indirect loop index variables are {t, u, v} whenthe statement if (x2[t(i)]) evaluates to true and {t, u, v, w} when thisstatement evaluates to false.

Let V≡{v₁, v₂, . . . v_(n)} be the set of all active array variablesthat appear in the loop body, V_(T) be the subset of V that containsonly those active array variables that are active when the Booleancondition evaluates to true, and V_(F) be the subset of V that containsonly those active array variables that are active when the Booleancondition evaluates to false. Furthermore, let I≡{i₁, i₂, . . . i_(r)}be the set of indirect loop index variables that is associated with theactive array variables in V, I_(T) be the set of indirect loop indexvariables that is associated with the active array variables in V_(T),and I_(F) be the set of indirect loop index variables that is associatedwith the active array variables in V_(F). Note that V≡V_(T)∪V_(F),I≡I_(T)∪I_(F), and the active array variables in V_(T)∩V_(F) are activein the loop body, independent of how the Boolean condition evaluates.

In the example of Table 9, these sets are outlined as follows.

-   -   V={x2, x3, x4, x5, x6}    -   V_(T)={x2, x3, x4, x5, x6}    -   V_(F)={x2, x3, x5, x6}    -   I={t, u, v, w}    -   I_(T)={t, u, v}    -   I_(F)={t, u, v, w}

Let the values of loop index i range from N₁ to N₂, and those of i₁, i₂,. . . i_(r) range at most from M₁ to M₂.

In the kth iteration (that is, i=N₁+k−1), the indirect loop indexvariables have values given by i₁(i), i₂(i), . . . i_(r)(i), and eachsuch value is in the range [M₁, M₂]. To facilitate the description offurther calculation steps, a different prime number p(1) is associatedwith each number 1 in the range [M₁, M₂]. The role of these primenumbers is explained in further detail below.

The parallelization algorithm proceeds according to the steps listedbelow as follows.

-   1. Create the arrays S_(A), S_(T) and S_(F) whose respective ith    element is given as follows.    -   S_(A)(i)=Π_(qεI) p(q(i))    -   S_(T)(i)=Π_(qεIr) p(q(i))    -   S_(F)(i)=Π_(qεIF) p(q(i))    -   These array elements are collectively referred to as the        indirectly indexed access pattern for iteration i. The use of        prime numbers in place of the indirect loop index values allows        a group of such index values to be represented by a unique        number. Thus S_(α)(i)=S_(β)(j), where α, βε{A, T, F}, if and        only if S_(α)(i) and S_(β)(j) each contain the same mix of prime        numbers. This property follows from the fundamental theorem of        arithmetic, which states that every whole number greater than        one can be written as a product of prime numbers. Apart from the        order of these prime number factors, there is only one such way        to represent each whole number as a product of prime numbers.        Note that one is not a prime number, and that two is the only        even number that is a prime.    -   Consequently, if the greatest common divisor (GCD) of S_(α)(i)        and S_(β)(j), is equal to one, there are no common prime numbers        between S_(α)(i) and S_(β)(j), and therefore, no common index        values between the ith (α-branch) and the jth (β-branch)        iterations. On the other hand, a greatest common divisor greater        than one implies that there is at least one common prime number        between S_(α)(i) and S_(β)(j) and, consequently, at least one        common index value between the ith (α-branch) and the jth        (β-branch) iterations.    -   The significance of the above result is that if the greatest        common divisor of S_(α)(i) and S_(β)(j) is equal to one then        cross-iteration dependencies do not exist between the ith        (α-branch) and the jth (β-branch) iterations.

2. Set k=1. Let R₁ be the set of values of the loop index i (which mayrange in value from N₁ to N₂), for which the loop can be run in parallelin the first “wave”. Let N≡{N₁, N₁+1, N₁+2, . . . , N₂}. The loop indexvalues that belong to R₁ are determined as described by the pseudocodeprovided in Table 10 below. TABLE 10 Initialize R₁ = {N₁}. do j = N₁, N₂if (C cannot be evaluated now) S(j) = S_(A)(j) else { if (C) S(j) =S_(T)(j) else S(j) = S_(F)(j) } if (j = N₁) continue; do i = N₁, j−1drop_j = GCD(S(i), S(j)) − 1 if (drop_j > 0) break // Indicates that i,j iterations interact. enddo if (drop_j = 0) R₁

R₁ ∪ {j} enddo

-   -   Following from the pseudocode of Table 10, if R₁≠N, go to step        3, or else go to step 4. The intent of the first loop in Table        10 is to first check whether the condition in the program loop        represented by C in the statement “if (C) . . . ” can be        evaluated before the iteration is executed. For example, a        condition appearing in a program loop, such as C≡t(i)−2>0, where        t(i) is an indirect loop index variable, can be evaluated        without any of the program loop iterations being executed since        the entire t array is known before the loop is entered. On the        other hand, a condition such as C≡×2[t(i)]!=0 can be evaluated        only if ×2[t(i)] has not been modified by any previous        iteration, otherwise not. If the condition C cannot be evaluated        before the program loop iteration is executed, then one cannot a        priori decide which indirect index variables are actually used        during execution and therefore all the indirect index variables        in I must be included in the analysis. When the condition C can        be evaluated before the program loop iteration is executed, then        one of I_(T) or I_(F), as found applicable, is chosen.

3. Set k←k+1 for the kth “wave” of parallel computations. Save the loopindex values of the kth wave in R_(k). To determine the values saved inR_(k), proceed as described by the pseudocode provided in Table 11below. TABLE 11 Initialize R_(k) = {l}, where l is the smallest index inthe set N − (R₁ ∪ R₂ ∪ ... ∪ R_(k−1)} do j = l, N₂ if (j ∈ R₁ ∪ R₂ ∪ ...∪ R_(k−1)) continue if (C cannot be evaluated now) S(j) = S_(A)(j) else{ if (C) S(j) = S_(T)(j) else S(j) = S_(F)(j) } if (j = l) continue do i= l, j−1 if (i ∈ R₁ ∪ R₂ ∪ ... ∪ R_(k−1)) continue drop_j = GCD(S(i),S(j)) − 1 if (drop_j > 0) break enddo if(drop_j = 0) R_(k)

R_(k) ∪ {j} enddo

-   -   Following from the pseudocode of Table 11 above, if R₁∪R₂ç. . .        ∪R_(k)≠N, repeat step 3, or else go to step 4.

-   4. All loop index values saved in a given R_(k) can be run in the    kth “wave”. Let n_(k) be the number of loop index values (n_(k) is    the number of iterations) saved in R_(k). Let n_(p) be the number of    available processors over which the iterations can be distributed    for parallel execution of the loop. The iterations can be scheduled    in many ways, especially if all the processors are not of the same    type (for example, in terms of speed, etc). A simple schedule is as    follows: Each of the first n_(l)=n_(k) mod n_(p) processors is    assigned successive blocks of (n_(k)/n_(p)+1) iterations, and the    remaining processors are assigned n_(k)/n_(p) iterations.

-   5. The “waves” are executed one after the other, in sequence,    subject to the condition that the next wave cannot commence    execution until the previous “wave” completely executes. This is    referred to as the wave synchronization criterion.

In relation to the above described procedure of steps 1 to 5, thefollowing observations are made.

-   -   (a) In step 1, the S_(α)(i)s, that is, S_(A)(i), S_(T)(i), and        S_(F)(i), can be calculated in parallel.    -   (b) The GCDs of S(i) and S(j) are calculated for j=N₁+1 to N₂        and for i=N₁ to j−1. The calculations are performed in parallel        since each GCD can be calculated independently.    -   (c) A possible way of parallelizing steps 2 and 3 is to dedicate        one processor to these calculations. Let this particular        processor calculate R₁. When R₁ is calculated, other processors        start calculating the loop iterations according to R₁, while the        particular processor starts calculating R₂. When R₂ is        calculated and the loop iterations according to R₁ are        completed, the other processors start calculating the loop        iterations according to R₂, while the same particular processor        starts calculating R₃, and so on.        Procedural Overview

Before providing example applications of the described techniques, anoverview of these described techniques is now provided with reference toFIG. 1. FIG. 1 is a flow chart of steps involved in performing thedescribed techniques. A set of active array variables and a set ofindirect loop index variables are determined for the loop underconsideration in step 110. Respective direct loop index values andindirect loop index values are determined in step 120.

Indirect loop index values i₁(i), i₂(i), . . . , i_(r)(i) are determinedfor each iteration, in step 130. Each such value so determined in step130 is associated with a unique prime number in step 140. For eachiteration, an array of values is then calculated that represents anindirectly indexed access pattern for that iteration, in step 150.

A grouping of iterations into a minimum number of waves is made suchthat the iterations comprising a wave are executable in parallel in step160.

Finally, the waves are sequentially scheduled in an orderly fashion toallow their respective iterations to execute in parallel in step 170.

FIGS. 2A, 2B and 2C present a flow chart of steps that outline, ingreater detail, steps involved in performing run-time parallelization asdescribed above. The flow charts are easy to understand if reference ismade to Table 10 for FIG. 2A and to Table 11 for FIGS. 2B and 2C.Initially, in step 202, active variables, V, V_(T), V_(F) and theircorresponding loop index variables I, I_(T), I_(F) are identified in theloop body. In this notation, the set V is assigned as the union of setsV_(T) and V_(F), and set I is assigned as the union of sets I_(T) andI_(F). Values for N₁, N₂, and M₁ and M₂ are determined, and primenumbers p(l) are assigned to each value of l in the inclusive rangedefined by [M₁, M₂].

Next, in step 204, arrays are created as defined in Equation [1] below.S _(A)(i)=Π_(qεI) p(q(i))S _(T)(i)=Π_(qεIT) p(q(i))S _(F)(i)=Π_(qEIF) p(q(i))   [1]

Also, k is assigned as 1, the set R₁ is assigned as {N₁}, and j isassigned as N₁. A determination is then made in step 206 whether theBoolean condition C can be evaluated. If C cannot be evaluated now, S(j)is assigned as S_(A)(j) in step 208. Otherwise, if C can be evaluatednow, a determination is made in step 210 whether C is true or false.

If C is true, S(j) is assigned as S_(T)(j) in step 212. Otherwise, S(j)is assigned as S_(F)(j) in step 214. After performing steps 208, 212 or214, a determination is made in step 216 of whether j is equal to N₁, orwhether there has been a change in j following step 204.

If j has changed, then i is assigned as N₁ in step 218, and drop_j isassigned as the greatest common divisor of S(i) and S(j) less one instep 220. A determination of whether drop_j is greater than 0 is made instep 222. If drop_j is not greater than 0, then i is incremented by onein step 224, and a determination is made of whether i is equal to j instep 226.

If i is not equal to j in step 226, then processing returns to step 220,in which drop_j is assigned to be the greatest common divisor of S(i)and S(j) less one. Processing proceeds directly to step 222, asdescribed directly above. If i is equal to j in step 226, thenprocessing proceeds directly to step 228.

If drop_j is greater than 0 in step 222, or if i equals j in step 226,then a determination is made in step 228 of whether drop_j is equal to0. If drop_j is equal to 0, the set R₁ is augmented with the set {j} bya set union operation. The variable j is then incremented by 1 in step232. If drop_j is not equal to 0 in step 228, then processing proceedsdirectly to step 232 in which the value of j is incremented by 1.

Once j is incremented in step 232, a determination is made in step 234of whether the value of j is greater than the value of N₂. If j is notgreater than N₂, then processing returns to step 206 to determinewhether C can be evaluated, as described above. Otherwise, if j isgreater than N₂, a determination is made of whether R₁ is equal to N instep 236. If R₁ is not equal to N in step 236, then processing proceedsto step 238: the value of k is incremented by one, and R_(k) is assignedas {l}, where l is the smallest index in the set N is less the setformed by the union of sets R₁ through to R_(k−1). Also, j is assignedto be equal to l.

After this step 238, a determination is made of whether j is an elementof the union of each of the sets R₁ through to R_(k−1). If j is such anelement in step 240, then j is incremented by one in step 242. Adetermination is then made in step 244 of whether the value of j is lessthan or equal to the value of N₂. If j is indeed less than or equal tothe value of N₂ in step 244, then processing returns to step 240.Otherwise, processing proceeds to step 278, as described below, if thevalue of j is determined to be greater than the value of N₂.

If in step 240, j is determined to be not such an element, then adetermination is made in step 246 of whether the Boolean condition C canbe evaluated. If C cannot be evaluated in step 246, then S(j) isassigned as S_(A)(j).

If, however, C can be evaluated, then in step 250 a determination ismade of whether C is true or false. If C is true, S(j) is assigned asS_(T)(j) in step 252, otherwise S(j) is assigned as S_(F)(j) in step254.

After performing either of steps 248, 252, or 254 as described above, adetermination is made in step 256 of whether j is equal to l, namelywhether there has been a change in j following step 238.

If the value of j is not equal to l, then the value of i is assigned asl in step 258. Following step 258, a determination is made in step 260of whether i is an element of the union of sets R₁ through to R_(k−1).If i is not an element, then drop_j is assigned to be the greatestcommon divisor of S(i) and S(j), less one, in step 262. Then adetermination is made in step 264 of whether drop_j is greater thanzero. If drop_j is not greater than zero, then the value of i isincremented by one in step 266. Then a determination is made in step 268of whether the value of i is equal to the value of j in step 268. If thevalues of i and j are not equal in step 268, then processing returns tostep 260 as described above.

If, however, the values of i and j are equal in step 268, then adetermination is made in step 270 of whether drop_j is equal to zero. Ifdrop_j is equal to zero in step 270, then the set R_(k) is augmented bythe set {j} using a union operator. If drop_j is not equal to zero instep 270, then the value of j is incremented by one in step 274. Thevalue of j is also incremented by one in step 274 directly afterperforming step 272, or after performing step 256, if the value of j isfound to equal the value of l.

After incrementing the value of j in step 274, a determination is madein step 276 of whether the value of j is greater than the value of N₂.If the value of j is not greater than the value of N₂, then processingreturns to step 240, as described above. Otherwise, if the value of j isgreater than the value of N₂, then processing proceeds to step 278. Step278 is also performed if the value of j is determined to be greater thanN₂ in step 244, as described above.

In step 278, a determination is made of whether the set N is equal tothe union of sets R₁ through to R_(k). If there is no equality betweenthese two sets in step 278, then processing returns to step 238, asdescribed above. Otherwise, if the two sets are determined to be equalin step 278, then step 280 is performed, in which the value of k issaved, and the value of i is assigned as a value of one. Step 280 isalso performed following step 236, if set N is determined to equal setR₁.

Following step 280, a determination is made in step 282 of whether thevalue of i is greater than the value of k. If the value of i is greaterthan the value of k in step 282, then processing stops in step 286.Otherwise, if the value of i is less than or equal to the value of k instep 282, then step 284 is performed in which iterations are executed inparallel for loop index values that are saved in the set R_(i). Thevalue of i is also incremented by one, and processing then returns tostep 282 as described above.

EXAMPLE 1

A first example is described with reference to the code fragment ofTable 12 below. TABLE 12 do i = 5, 9 x1[t(i)] = x2[r(i)] if (t(i) > 2)x2[u(i)] = x1[v(i)] else x2[u(i)] = x1[t(i)] enddo

In Table 12 above, since x1 and x2 are the only active array variables,the indirect loop index variables r(i), t(i), u(i), v(i) associated withthese variables are the only index variables that are considered. Thevalues of r(i), t(i), u(i), v(i) are provided in Table 13 below. TABLE13 Indirect index variable i = 5 i = 6 i = 7 i = 8 i = 9 r(i) 1 2 3 4 4t(i) 1 2 2 1 4 u(i) 1 2 2 4 1 v(i) 1 2 3 1 1

By inspection, M₁=1, M₂=4, and N₁=5, N₂=9. A unique prime number isassociated with each of the values 1, 2, 3, 4 that one or more of theindirect index variables can attain: p(1)=3,p(2)=5,p(3)=7, p(4)=11.

The pseudocode in Table 14 below illustrates the operations that areperformed with reference to steps 1 to 5 described above in thesubsection entitled “Loop parallelization procedure”. TABLE 14 Step 1S_(A)(i) = S_(T)(i) = p(r(i)) × p(t(i)) × p(u(i)) × p(v(i)) for i = 5,6, 7, 8, 9. S_(A) (5) = S_(T) (5) = p(1) × p(1) × p(1) × p(1) = 3 × 3 ×3 × 3 = 81 S_(A) (6) = S_(T) (6) = p(2) × p(2) × p(2) × p(2) = 5 × 5 × 5× 5 = 625 S_(A) (7) = S_(T) (7) = p(3) × p(2) × p(2) × p(3) = 7 × 5 × 5× 7 = 1225 S_(A) (8) = S_(T) (8) = p(4) × p(1) × p(4) × p(1) = 11 × 3 ×11 × 3 = 1089 S_(A) (9) = S_(T) (9) = p(4) × p(4) × p(1) × p(1) = 11 ×11 × 3 × 3 = 1089 S_(F)(i) = p(r(i)) × p(t(i)) × p(u(i)) for i = 5, 6,7, 8, 9. S_(F) (5) = p(1) × p(1) × p(1) = 3 × 3 × 3 = 27 S_(F) (6) =p(2) × p(2) × p(2) = 5 × 5 × 5 = 125 S_(F) (7) = p(3) × p(2) × p(2) = 7× 5 × 5 = 175 S_(F) (8) = p(4) × p(1) × p(4) = 11 × 3 × 11 = 363 S_(F)(9) = p(4) × p(4) × p(1) = 11 × 11 × 3 = 363 Step 2 Set k = 1, R₁ = {5}.j = 5: if cond = FALSE; S(5) = S_(F)(5) = 27; j = 6: if cond = FALSE;S(6) = S_(F)(6) = 125; i = 5: GCD(27, 125) = 1; R₁ = {5, 6} j = 7: ifcond = FALSE; S(7) = S_(F)(7) = 175; i = 5: GCD(27, 175) = 1; i = 6:GCD(125, 175) ≠ 1; terminate loop R₁ = {5, 6} j = 8: if cond = FALSE;S(8) = S_(F)(8) = 363; i = 5: GCD(27, 363) ≠ 1; terminate loop R₁ = {5,6} j = 9: if cond = TRUE; S(9) = S_(T)(9) = 1089; i = 5: GCD(27, 1089) ≠1; terminate loop R₁ = {5, 6} Since R₁ ≠ N, go to step 3. Step 3 Set k =2, l = 7, R₂ = {7}. j = 7: j ∉ R₁; if cond = FALSE; S(7) = S_(F)(7) =175; j = 8: j ∉ R₁; if cond = FALSE; S(8) = S_(F)(8) = 363; i = 7: i ∉R₁; GCD(175, 363) = 1; R₂ = {7, 8} j = 9: j ∉ R₁; if cond = TRUE; S(9) =S_(T)(9) = 1089; i = 7: i ∉ R₁; GCD(175, 1089) = 1; i = 8: i ∉ R₁;GCD(363, 1089) ≠ 1; terminate loop R₂ = {7, 8} Since R₁ ∪ R₂ ≠ N, repeatstep 3. Set k = 3, l = 9, R₃ = {9}. j = 9: j ∉ (R₁ ∪ R₂); if cond =TRUE; S(9) = S_(T)(9) = 1089; No further iterations. R₃ = {9} Since R₁ ∪R₂ ∪ R₃ = N, go to step 4. Steps 4 and 5 Execute as outlined in steps 4and 5 in the subsection entitled “Loop parallelization procedure”.Notice that there are 5 iterations and 3 waves: R₁ = {5, 6}, R₂ = {7,8}, R₃ = {9}.

EXAMPLE 2

A second example is described with reference to the code fragment ofTable 15 below. TABLE 15 do i = 5, 9 x1[t(i)] = x2[r(i)] + . . . if(x1[t(i)] > 0) x2[u(i)] = x1[v(i)] + . . . else x2[u(i)] = x1[t(i)] + .. . enddo

In the example of Table 15 above, since x1, x2 are the only active arrayvariables, the indirect loop index variables r(i), t(i), u(i), v(i)associated with these variables are the index variables that areconsidered for parallelization. Values of r(i), t(i), u(i), v(i) aretabulated in Table 16 below. TABLE 16 Indirect index variable i = 5 i =6 i = 7 i = 8 i = 9 r(i) 1 2 3 4 4 t(i) 1 2 2 1 4 u(i) 1 2 2 4 1 v(i) 12 3 3 1

By inspection, M₁=1, M₂=4, and N₁=5, N₂=9. A unique prime number isassociated with each of the values 1, 2, 3, 4 that one or more of theindirect index variables attains: p(1)=3, p(2)=5, p(3)=7, p(4)=11. Thatis, p( ) simply provides consecutive prime numbers, though anyalternative sequence of prime numbers can also be used.

The pseudocode in Table 17 below illustrates the operations that areperformed with reference to steps 1 to 5 described above in thesubsection entitled “Loop parallelization procedure”. TABLE 17 Step 1S_(A) (i) = S_(T)(i) = p(r(i)) × p(t(i)) × p(u(i)) × p(v(i)) for i = 5,6, 7, 8, 9. S_(A) (5) = S_(T) (5) = p(1) × p(1) × p(1) × p(1) = 3 × 3 ×3 × 3 = 81 S_(A) (6) = S_(T) (6) = p(2) × p(2) × p(2) × p(2) = 5 × 5 × 5× 5 = 625 S_(A) (7) = S_(T) (7) = p(3) × p(2) × p(2) × p(3) = 7 × 5 × 5× 7 = 1225 S_(A) (8) = S_(T) (8) = p(4) × p(1) × p(4) × p(3) = 11 × 3 ×11 × 7 = 2541 S_(A) (9) = S_(T) (9) = p(4) × p(4) × p(1) × p(1) = 11 ×11 × 3 × 3 = 1089 S_(F)(i) = p(r(i)) × p(t(i)) × p(u(i)) for i = 5, 6,7, 8, 9. S_(F) (5) = p(1) × p(1) × p(1) = 3 × 3 × 3 = 27 S_(F) (6) =p(2) × p(2) × p(2) = 5 × 5 × 5 = 125 S_(F) (7) = p(3) × p(2) × p(2) = 7× 5 × 5 = 175 S_(F) (8) = p(4) × p(1) × p(4) = 11 × 3 × 11 = 363 S_(F)(9) = p(4) × p(4) × p(1) = 11 × 11 × 3 = 363 Step 2 Set k = 1, R₁ = {5}.j = 5: if cond cannot be evaluated; S(5) = S_(A)(5) = 81; j = 6: if condcannot be evaluated; S(6) = S_(A)(6) = 625; i = 5: GCD(81, 625) = 1; R₁= {5, 6} j = 7: if cond cannot be evaluated; S(7) = S_(A)(7) =1225; i =5: GCD(81, 1225) = 1; i = 6: GCD(625, 1225) ≠ 1; terminate loop R₁ = {5,6} j = 8: if cond cannot be evaluated; S(8) = S_(A)(8) = 2541; i = 5:GCD(81, 2541) ≠ 1; terminate loop R₁ = {5, 6} j = 9: if cond cannot beevaluated; S(9) = S_(A)(9) = 1089; i = 5: GCD(81, 1089) ≠ 1; terminateloop R₁ = {5, 6} Since R₁ ≠ N, go to step 3. Step 3 Set k = 2, l = 7, R₂= {7}. j = 7: j ∉ R₁; if cond cannot be evaluated; S(7) = S_(A)(7) =1225; j = 8: j ∉ R₁; if cond cannot be evaluated; S(8) = S_(A)(8) =2541; i = 7: i ∉ R₁; GCD(1225, 2541) ≠ 1; terminate loop R₂ = {7} j = 9:j ∉ R₁; if cond cannot be evaluated; S(9) = S_(A)(9) = 1089; i = 7: i ∉R₁; GCD(1225, 1089) = 1; i = 8: i ∉ R₁; GCD(2541, 1089) ≠ 1; terminateloop R₂ = {7} Since R₁ ∪ R₂ ≠ N, repeat step 3. Set k = 3, l = 8, R₃ ={8}. j = 8: j ∉ (R₁ ∪ R₂); if cond cannot be evaluated; S(8) = S_(A)(8)= 2541; j = 9: j ∉ (R₁ ∪ R₂); if cond cannot be evaluated; S(9) =S_(A)(9) = 1089; i = 8: i ∉ (R₁ ∪ R₂); GCD(2541, 1089) ≠ 1; terminateloop R₃ = {8} Set k = 4, l = 9, R₄ = {9}. j = 9: j ∉ (R₁ ∪ R₂ ∪ R₃); ifcond cannot be evaluated; S(9) = S_(A)(9) = 1089; No further iterations.R₄ = {9} Since R₁ ∪ R₂ ∪ R₃ ∪ R₄ = N, go to step 4. Steps 4 and 5Execute as outlined in steps 4 and 5 in the subsection entitled “Loopparallelization procedure”. Notice that in this example there are 5iterations and 4 waves: R₁ = {5, 6}, R₂ = {7}, R₃ = {8}, R₄ = {9}.

EXAMPLE 3

A third example is described with reference to the code fragment ofTable 18 below. TABLE 18 do i = 5, 9 x1[t(i)] = x2[r(i)] + . . . if(x1[t(i)] > 0 || t(i) > 2) x2[u(i)] = x1[v(i)] + . . . else x2[u(i)] =x1[t(i)] + . . . enddo

In the example of Table 18 above, since x1, x2 are the only active arrayvariables, the indirect loop index variables r(i), t(i), u(i), v(i)associated with them are the index variables to be considered forparallelization.

Values of r(i), t(i), u(i), and v(i) are tabulated in Table 19 below.TABLE 19 Indirect index variable i = 5 i = 6 i = 7 i = 8 i = 9 r(i) 1 23 4 4 t(i) 1 2 3 1 4 u(i) 1 2 2 4 1 v(i) 1 2 3 3 1

By inspection, M₁=1, M₂=4, and N₁=5, N₂=9. A unique prime number isassociated with each of the values 1, 2, 3, 4 that one or more of theindirect index variables attains: p(1)=3, p(2)=5, p(3)=7, p(4)=11.

The pseudocode in Table 20 below illustrates the operations that areperformed with reference to steps 1 to 5 described above in thesubsection entitled “Loop parallelization procedure”. TABLE 20 Step 1S_(A) (i) = S_(T)(i) = p(r(i)) × p(t(i)) × p(u(i)) × p(v(i)) for i = 5,6, 7, 8, 9. S_(A) (5) = S_(T) (5) = p(1) × p(1) × p(1) × p(1) = 3 × 3 ×3 × 3 = 81 S_(A) (6) = S_(T) (6) = p(2) × p(2) × p(2) × p(2) = 5 × 5 × 5× 5 = 625 S_(A) (7) = S_(T) (7) = p(3) × p(3) × p(2) × p(3) = 7 × 7 × 5× 7 = 1715 S_(A) (8) = S_(T) (8) = p(4) × p(1) × p(4) × p(3) = 11 × 3 ×11 × 7 = 2541 S_(A) (9) = S_(T) (9) = p(4) × p(4) × p(1) × p(1) = 11 ×11 × 3 × 3 = 1089 S_(F)(i) = p(r(i)) × p(t(i)) × p(u(i)) for i = 5, 6,7, 8, 9. S_(F) (5) = p(1) × p(1) × p(1) = 3 × 3 × 3 = 27 S_(F) (6) =p(2) × p(2) × p(2) = 5 × 5 × 5 = 125 S_(F) (7) = p(3) × p(3) × p(2) = 7× 7 × 5 = 245 S_(F) (8) = p(4) × p(1) × p(4) = 11 × 3 × 11 = 363 S_(F)(9) = p(4) × p(4) × p(1) = 11 × 11 × 3 = 363 Step 2 Set k = 1, R₁ = {5}.j = 5: ‘ if cond’ cannot be evaluated; S(5) = S_(A)(5) = 81; Comment:The ‘if cond’ cannot be evaluated since even though ‘t(i) > 2’ is false,the ‘or’ operator requires that x1[t(i)] must also be evaluated tofinally determine the ‘if cond’. If the ‘if cond’ had turned out to betrue, then evaluation of x1[t(i)] would not have been necessary in viewof the ‘or’ operator. j = 6: ‘ if cond’ cannot be evaluated; S(6) =S_(A)(6) = 625; i = 5: GCD(81, 625) = 1; R₁ = {5, 6} j = 7: if cond =TRUE; S(7) = S_(T)(7) = 1715; Comment: The ‘if cond’ is true because‘t(i) > 2’ is true. Therefore x1[t(i)] need not be evaluated in thepresence of the ‘or’ operator. i = 5: GCD(81, 1715) = 1; i = 6: GCD(625,1715) ≠ 1; terminate loop R₁ = {5, 6} j = 8: ‘if cond’ cannot beevaluated; S(8) = S_(A)(8) = 2541; i = 5: GCD(81, 2541) ≠ 1; terminateloop R₁ = {5, 6} j = 9: ‘if cond’ = TRUE; S(9) = S_(T)(9) = 1089;Comment: The ‘if cond’ is true because ‘t(i) > 2’ is true. Thereforex1[t(i)] need not be evaluated in the presence of the ‘or’ operator. i =5: GCD(81, 1089) ≠ 1; terminate loop R₁ = {5, 6} Since R₁ ≠ N, go tostep 3. Step 3 Set k = 2, l = 7, R₂ = {7}. j = 7: j ∉ R₁; ‘if cond’ =TRUE; S(7) = S_(T)(7) = 1715; j = 8: j ∉ R₁; ‘if cond’ cannot beevaluated; S(8) = S_(A)(8) = 2541; i = 7: i ∉ R₁; GCD(1715, 2541) ≠ 1;terminate loop R₂ = {7} j = 9: j ∉ R₁; ‘ if cond’ = TRUE; S(9) =S_(T)(9) = 1089; i = 7: i ∉ R₁; GCD(1715, 1089) = 1; i = 8: i ∉ R₁;GCD(2541, 1089) ≠ 1; terminate loop R₂ = {7} Since R₁ ∪ R₂ ≠ N, repeatstep 3. Set k = 3, l = 8, R₃ = {8}. j = 8: j ∉ (R₁ ∪ R₂); ‘if cond’cannot be evaluated; S(8) = S_(A)(8) = 2541; j = 9: j ∉ (R₁ ∪ R₂); ‘ifcond’ = TRUE; S(9) = S_(T)(9) = 1089; i = 8: i ∉ (R₁ ∪ R₂); GCD(2541,1089) ≠ 1; terminate loop R₃ = {8} Set k = 4, l = 9, R₄ = {9}. j = 9: j∉ (R₁ ∪ R₂ ∪ R₃); ‘if cond’ = TRUE; S(9) = S_(T)(9) = 1089; No furtheriterations. R₄ = {9} Since R₁ ∪ R₂ ∪ R₃ ∪ R₄ = N, go to step 4. Steps 4and 5 Execute as outlined in steps 4 and 5 in the subsection entitled“Loop parallelization procedure”. Notice that in this example too thereare 5 iterations and 4 waves: R₁ = {5, 6}, R₂ = {7}, R₃ = {8}, R₄ = {9}.Case when no Conditional Statements are Present in the Loop

In this case put V=V_(A), I=I_(A), S=S_(A). Since there is noconditional statement C in the loop, the statement “if (C cannot beevaluated now) . . . ”, wherever it appears in the loop parallelizationalgorithm described above, is assumed to evaluate to “true”.

Extension of the Method to Include Multiple Boolean Conditions

Inclusion of more than one Boolean condition in a loop body increasesthe number of decision paths (to a maximum of 3^(r), where r is thenumber of Boolean conditions) available in a loop. The factor 3 appearsbecause each condition may have one of three states: true, false, notdecidable, even though the condition is Boolean. For each path λ, it isnecessary to compute an S_(λ)(i) value for each iteration i. This isdone by modifying the code fragment shown in Table 21 which appears insteps 2 and 3 of the “Loop parallelization procedure” described above.TABLE 21 if (C is not d cidable) S(j) = S_(A)(j) els { if (C) S(j) =S_(T)(j) else S(j) = S_(F)(j) }

The modification replaces the code fragment byif (λ=path(i)) S(i)=S _(λ)(i)where the function path(i) evaluates the Boolean conditions in the pathand returns a path index λ. The enumeration of all possible paths, foreach loop in a program, can be done by a compiler and the informationprovided to the run-time system in an appropriate format. Typically,each Boolean condition is provided with a unique identifier, which isthen used in constructing the paths. When such an identifier appears ina path it is also tagged with one of three states, say, T (for true), F(for false), A (for not decidable, that is, carry all active arrayvariables) as applicable for the path. A suggested path format is thefollowing string representation

-   ident_(—)1:x_(—)1 ident_(—)2:x_(—)2 . . . ident_n:x_n;,    where ident_i identifies a Boolean condition in a loop and x_i one    of its possible state T, F, or A. Finally, this string is appended    with the list of indirect loop index variables that appear with the    active variables in the path. A suggested format is-   ident_(—)1:x_(—)1 ident_(—)2:x_(—)2 . . . ident_n:x_n; {I_(λ)},    where {I_(λ)} comprises the set of indirect loop index variables    (any two variables being separated by a comma), and the construction    of any of ident_n, x_n, or elements of the set {I_(λ)} do not use    the delimiter characters ‘:’, ‘;’ or ‘,’. The left-to-right sequence    in which the identifiers appear in a path string corresponds to the    sequence in which the Boolean conditions will be encountered in the    path at run-time. Let Q={q₁, q₂, . . . , q_(m)} be the set of m    appended path strings found by a compiler. A typical appended path    string q_(λ) in Q may appear as-   q_(λ)≡id4:T id7:T id6:F id8:T; {u, r, t},    where the path portion represents the execution sequence wherein the    Boolean condition with the identifier id4 evaluates to true, id7    evaluates to true, id6 evaluates to false, id8 evaluates to true,    and the path has the indirect loop index variables {(u, r, t}    associated with its active variables.

With the formatted set Q of all possible appended path strings availablefrom a compiler, the run-time system then needs only to construct a pathq for each iteration being considered in a wave, compare q with thepaths in Q, and decide upon the parallelizing options available to it.

The simplest type of path the run-time system can construct is one forwhich each Boolean condition, in the sequence of Boolean conditionsbeing evaluated in an iteration, evaluates to either true or false. Insuch a case, the exact path in the iteration is known. Let q be such apath, which in the suggested format appears as

-   q≡ident_(—)1:x_(—)1 ident_(—)2:x_(—)2 . . . ident_n:x_n,.

A string match with the set of strings available in Q will show that qwill appear as a path in one and only one of the strings in Q (since qwas cleverly formatted to end with the character ‘;’ which does notappear in any other part of the string), say, q_(λ) and the functionpath(i) will return the index λ on finding this match. The set ofindirect loop index variables {I_(λ)} can be plucked from the trailingpart of q_(λ) for calculating S_(λ)(i).

When the run-time system, while constructing a path q, comes across aBoolean condition that evaluates to not decidable, it means that adefinite path cannot be determined before executing the iteration. Insuch a case, the construction of the path is terminated at theundecidable Boolean condition encountered after encoding the Booleancondition and its state (A) into the path string. For example, let thisundecidable Boolean condition have the identifier idr, then the path qwould terminate with the substring idr:A;. A variation of q is nowconstructed which is identical to q except that the character ‘;’ isreplaced by the blank character ‘ ’. Let q′ be this variation. All thestrings in Q for which either q or q′ is an initial substring (meaningthat q will appear as a substring from the head of whatever string in Qit matches with) is a possible path for the iteration underconsideration. (There will be more than one such path found in Q.) Insuch a case the path( ) function will return an illegal λ value (in thisembodiment it is −1) and S_(λ)(i) is computed using the set of indirectindex variables given by the union of all the indirect index variablesets that appear in the paths in Q for which either of q or q′ was foundto be an initial substring. Note that S⁻¹(i) does not have a uniquevalue (unlike the other S_(λ)(i)s which could be precalculated andsaved) but must be calculated afresh every time path(i) returns −1.

Nested Indexing of Indirect Index Variables

The case in which one or more of the indirect index variables, forexample, i_(k), is further indirectly indexed as i_(k)(l) where l(i), inturn, is indirectly indexed to i, is handled by treating i_(k)(l) asanother indirect index variable, for example, i_(t)(i). Indeed, l,instead of being an array can be any function of i.

Use of Bit Vectors Instead of Prime Numbers

Instead of defining S_(λ)(i), where λ is a decision path in the loop, interms of the product of prime numbers, one may use a binary bit vector.Here one associates a binary bit, in place of a prime number, for eachnumber in the range [M₁, M₂]. That is, the k-th bit of a bit vectorS_(λ)(i) when set to 1 denotes the presence of the prime number p(k) inS_(λ)(i). Alternatively, the notation b_(λi) may be used for the k-thbit of this bit vector. If a logical AND operation between any two bitvectors S_(α)(i) and S_(β)(j) produces a null bit vector, then thedecision paths corresponding to S_(α)(i) and S_(β)(j) do not sharecommon values of the indirect index variables. This is equivalent to theexpression GCD(S_(α)(i), S_(β)(j))=1 described above.

Computer Hardware and Software

FIG. 3 is a schematic representation of a computer system 300 that isprovided for executing computer software programmed to assist inperforming run-time parallelization of loops as described herein. Thiscomputer software executes on the computer system 300 under a suitableoperating system installed on the computer system 300.

The computer software is based upon computer program comprising a set ofprogrammed instructions that are able to be interpreted by the computersystem 300 for instructing the computer system 300 to performpredetermined functions specified by those instructions. The computerprogram can be an expression recorded in any suitable programminglanguage comprising a set of instructions intended to cause a suitablecomputer system to perform particular functions, either directly orafter conversion to another programming language.

The computer software is programmed using statements in an appropriatecomputer programming language. The computer program is processed, usinga compiler, into computer software that has a binary format suitable forexecution by the operating system. The computer software is programmedin a manner that involves various software components, or code means,that perform particular steps in accordance with the techniquesdescribed herein.

The components of the computer system 300 include: a computer 320, inputdevices 310, 315 and video display 390. The computer 320 includes:processor 340, memory module 350, input/output (I/O) interfaces 360,365, video interface 345, and storage device 355. The computer system300 can be connected to one or more other similar computers, using ainput/output (I/O) interface 365, via a communication channel 385 to anetwork 380, represented as the Internet.

The processor 340 is a central processing unit (CPU) that executes theoperating system and the computer software executing under the operatingsystem. The memory module 350 includes random access memory (RAM) andread-only memory (ROM), and is used under direction of the processor340.

The video interface 345 is connected to video display 390 and providesvideo signals for display on the video display 390. User input tooperate the computer 320 is provided from input devices 310, 315consisting of keyboard 310 and mouse 315. The storage device 355 caninclude a disk drive or any other suitable non-volatile storage medium.

Each of the components of the computer 320 is connected to a bus 330that includes data, address, and control buses, to allow thesecomponents to communicate with each other via the bus 330.

The computer software can be provided as a computer program productrecorded on a portable storage medium. In this case, the computersoftware is accessed by the computer system 300 from the storage device355. Alternatively, the computer software can be accessed directly fromthe network 380 by the computer 320. In either case, a user can interactwith the computer system 300 using the keyboard 310 and mouse 315 tooperate the computer software executing on the computer 320.

The computer system 300 is described only as an example for illustrativepurposes. Other configurations or types of computer systems can beequally well used to implement the described techniques.

Various alterations and modifications can be made to the techniques andarrangements described herein, as would be apparent to one skilled inthe relevant art.

Conclusion

Techniques and arrangements are described herein for performing run-timeparallelization of loops in computer programs having indirect loop indexvariables and embedded conditional variables. Various alterations andmodifications can be made to the techniques and arrangements describedherein, as would be apparent to one skilled in the relevant art.

1. A method for detecting cross-iteration dependencies between variablesin a loop of a computer program, the method comprising the steps of:associating unique values with each of the values of indirect loop indexvariables of the loop; calculating for each iteration of the loop anindirectly indexed access pattern based upon the associated uniquevalues; and determining whether cross-iteration dependencies existbetween any two iterations of the loop based upon the indirectly indexedaccess pattern of the two iterations.
 2. The method as claimed in claim1, wherein the unique values associated with each of the values of theindirect loop index variables of the loop are different binary bitpatterns of a bit vector.
 3. The method as claimed in claim 2, whereinthe indirectly indexed access pattern for an iteration is calculated byforming the logical AND of the unique bit patterns associated with eachof the values of the indirect loop index variables of the loop for thatiteration.
 4. The method as claimed in claim 3, wherein the existence ofcross-iteration dependencies is determined by determining whether theindirectly indexed access pattern for the two iterations have any commonbit positions that share a value of one.
 5. The method as claimed inclaim 1, wherein the unique values associated with each of the values ofthe indirect loop index variables of the loop are different primenumbers.
 6. The method as claimed in claim 5, wherein the indirectlyindexed access pattern for an iteration is calculated by forming theproduct of the unique prime numbers associated with each of the valuesof the indirect loop index variables of the loop for that iteration. 7.The method as claimed in claim 6, wherein the existence ofcross-iteration dependencies is determined by determining whether agreatest common divisor between the two indirectly indexed accesspatterns for the two corresponding iterations is greater than one. 8.The method as claimed in claim 1, further comprising the step ofgrouping iterations in a wave for execution in a common time period suchthat no cross-iteration dependencies exist between any of the groupediterations of the wave.
 9. The method as claimed in claim 8, furthercomprising the step of executing each of said waves in a prescribedsequence, and executing each of said iterations in each of said waves inparallel with each other.
 10. A method for assisting in schedulingparallel computation of instructions in a loop of a computer program,the method comprising the steps of: determining, for a loop, activearray variables, and direct and indirect loop index variables;determining, for each iteration of the loop, values of the indirect loopindex variables; associating a unique value with each of values of theindirect loop index variables; calculating an indirectly indexed accesspattern for each iteration of the loop; and determining whethercross-iteration dependencies exist between any two iterations of theloop based upon the indirectly indexed access patterns of the twoiterations.
 11. A method for detecting cross-iteration dependenciesbetween variables in a loop of a computer program, the method comprisingthe steps of: determining, for a loop, Boolean conditions embedded inthe loop; determining possible decision paths an iteration is allowed totake in the loop body in the presence of Boolean conditions; determininga first set V={v₁, v₂, . . . v_(n)} of active array variables of theloop, and a second set I={i₁, i₂, . . . i_(r)} of indirect loop indexvariables that appear with the active array variables of said first setin the loop; determining, for each decision path λ in the set ofpossible decision paths in the loop, the set V_(λ) of active arrayvariables associated with the path, and the set I_(λ) of indirect loopindex variables that appear with the active array variables V_(λ).;determining the range [N₁, N₂] of the direct loop index i of the loop,and the maximal range [M₁, M₂] of the set of indirect loop indexvariables i₁, i₂, . . . i_(r) of the loop; associating, for each value lof the indirect loop index variables in the range [M₁, M₂], a uniquevalue p(l) with each value l of the indirect loop index variables in therange [M₁, M₂]; determining, for each pair (i, λ) of a value of thedirect loop index i and a decision path λ in the set of all possibledecision paths in the loop, the value of S_(λ)(i) where S_(λ)(i) is theproduct of the unique values p(l) associated with each value l of theindirect loop index variables in the range [M₁, M₂]; and determining,for any pair S_(α)(i) and S_(β)(j), whether the values of S_(α)(i) andS_(β)(j) indicate that cross-iteration dependencies exist betweeniterations i and j.
 12. The method as claimed in claim 11, wherein theunique values p(l) are different prime numbers and the values ofS_(α)(i) and S_(β)(j) indicate that no cross-iteration dependenciesexist between iterations i and j if the greatest common divisor ofS_(α)(i) and S_(β)(j) is 1 when i≠j.
 13. The method as claimed in claim11, wherein the unique values p(l) are different bit patterns in a bitvector, and the values of S_(α)(i) and S_(β)(j) indicate that nocross-iteration dependencies exist between iterations i and j ifS_(α)(i) and S_(β)(j) do not share any common bits that are set to 1when i≠j.
 14. The method as claimed in claim 11, further comprising thestep of: grouping loop iterations in waves such that any two iterationsi and j of a particular wave have no cross-dependencies.
 15. The methodas claimed in claim 14, further comprising the step of: executing thewaves in which loop iterations are grouped in a predetermined sequencesuch that the wave having the lowest value of the direct loop index i isexecuted first and completely, followed by the wave with the next lowestvalue of i, and so on for successive values of i; and
 16. The method asclaimed in claim 14, further comprising the step of: executing theiteration in each wave in parallel using multiple computing processors.17. A computer program product for detecting cross-iterationdependencies between variables in a loop of a computer program, thecomputer program product comprising computer software stored on acomputer-readable medium for performing the steps of: associating uniquevalues with each of the values of indirect loop index variables of theloop; calculating for each iteration of the loop an indirectly indexedaccess pattern based upon the associated unique values; and determiningwhether cross-iteration dependencies exist between any two iterations ofthe loop based upon the indirectly indexed access pattern of the twoiterations.
 18. A computer system for detecting cross-iterationdependencies between variables in a loop of a computer program, thecomputer system executing computer software stored on acomputer-readable medium for performing the steps of: associating uniquevalues with each of the values of indirect loop index variables of theloop; calculating for each iteration of the loop an indirectly indexedaccess pattern based upon the associated unique values; and determiningwhether cross-iteration dependencies exist between any two iterations ofthe loop based upon the indirectly indexed access pattern of the twoiterations.